The University of Maryland at College Park (UMD) is a flagship university at Maryland and is one of the top universities in computer science. We decided to take a look into their annual budget and salary reports to find any trends. In our project, we have pulled data from multiple sources for our analysis. Each source has it's own dataframe and has been retrieved in it's own unique way. Therefore, in the below descriptions, each dataframe has a link that links to a IPython notebook file that contains the steps required to produce that dataframe. All the data was pulled from either the Department of Budget and Fiscal Analaysis or UMD's salary guide. Each dataframe will be described below with a short description and a graph describing it's contents. We have written this project in Python 3 and have used a number of libraries including: pandas, numpy, matplotlib, scikit-learn, and Beautiful Soup.
# Required dependencies for this project
import requests
import pandas as pd
from bs4 import BeautifulSoup
import pickle
import plotly.express as px
import numpy as np
import copy
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import matplotlib.ticker as mtick
from matplotlib.pyplot import cm
All the dataframes used in this project are in the below dataframes.
expenses = pd.read_pickle('df/expenses')
revenue = pd.read_pickle('df/revenue')
salary_increases = pd.read_pickle('df/salary_increases')
tuition_fees = pd.read_pickle('df/tuition_fees')
salaries = pd.read_pickle('df/salaries')
Below are some functions used to render plots.
def plot_line_graphs_tuition(tuition_fees, features, title, figsize):
plt.style.use('seaborn-whitegrid')
_, ax = fig, ax = plt.subplots(1, 1, figsize=figsize, facecolor='w')
color=iter(cm.rainbow(np.linspace(0, 1, len(features))))
for feature in features:
c = next(color)
plt.plot(tuition_fees['Year'], tuition_fees[feature], marker='', linewidth=5, alpha=0.9, label=feature, c=c)
fmt = '${x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)
ax.yaxis.set_major_formatter(tick)
ax.tick_params(axis='both', which='major', pad=30)
plt.title(title, fontsize=70, pad=20)
plt.xlabel("Year", fontsize=50, labelpad=20)
plt.ylabel("$ in Dollars", fontsize=50, labelpad=20)
plt.xticks(fontsize=30)
plt.yticks(fontsize=30)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), prop={'size': 30})
plt.show()
def plot_line_graphs_expenses(features, title, figsize):
plt.style.use('seaborn-whitegrid')
_, ax = fig, ax = plt.subplots(1, 1, figsize=figsize, facecolor='w')
color=iter(cm.rainbow(np.linspace(0, 1, len(features))))
for feature in features:
c = next(color)
plt.plot(sorted(list(expenses['Year'].unique())), features[feature], marker='', linewidth=5, alpha=0.9, label=feature, c=c)
fmt = '${x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)
ax.yaxis.set_major_formatter(tick)
ax.tick_params(axis='both', which='major', pad=30)
plt.title(title, fontsize=70, pad=20)
plt.xlabel("Year", fontsize=50, labelpad=20)
plt.ylabel("$ in Dollars", fontsize=50, labelpad=20)
plt.xticks(fontsize=30)
plt.yticks(fontsize=30)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), prop={'size': 30})
plt.show()
def plot_line_graphs_revenue(features, title, figsize):
plt.style.use('seaborn-whitegrid')
_, ax = fig, ax = plt.subplots(1, 1, figsize=figsize, facecolor='w')
color=iter(cm.rainbow(np.linspace(0, 1, len(features))))
for feature in features:
c = next(color)
plt.plot(sorted(list(revenue['Year'].unique())), features[feature], marker='', linewidth=5, alpha=0.9, label=feature, c=c)
fmt = '${x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)
ax.yaxis.set_major_formatter(tick)
ax.tick_params(axis='both', which='major', pad=30)
plt.title(title, fontsize=70, pad=20)
plt.xlabel("Year", fontsize=50, labelpad=20)
plt.ylabel("$ in Dollars", fontsize=50, labelpad=20)
plt.xticks(fontsize=30)
plt.yticks(fontsize=30)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), prop={'size': 30})
plt.show()
The tuition_fees dataframe contains tuition rates for undergraduate, graduate, out-of-state, and in-state students over time. If you would like to learn more about how this dataframe was created, please click here: https://adharsh.github.io/Tuition_and_Fees.html
tuition_fees
figsize = (20, 12)
title = "Mandatory Tuition Fees Over Time"
features = list(tuition_fees.columns)[7:11]
plot_line_graphs_tuition(tuition_fees, features, title, figsize)
From the above graph, it seems like full-time students pay the most money in terms of mandatory fees (which makes sense) of which undergrads pay significantly more. There is a big variation from how much the undergraduate and graduate students pay when studying full-time than when they study part-time. For part-time students, it looks like they pay basically the same amount.
figsize = (20, 12)
title = "Standard Tuition Fee Rate Per Credit Hour Over Time"
features = list(tuition_fees.columns)[3:7]
plot_line_graphs_tuition(tuition_fees, features, title, figsize)
This graph is very interesting as it shows how the university charges in-state and out-of-state students. Out-of-state students get charged substantially more, and over time, you can see that the university keeps charging them more and more. In-state students' tuitions have increased at a much smaller rate. In-state and out-of-state graduate students pay signifcantly more than undergraduate students, which makes sense since graduate programs are considerd "cash cow" programs for universities in general.
The expenses dataframe contains data on how the university spends it's money by category and department. If you would like to learn more about how this dataframe was created, please click here: https://adharsh.github.io/Expenses.html
expenses
Each expense has a 'Group Type'. For each 'Group Type', there is a list of 'Group' data.
expenses.groupby(by=['Group Type', 'Group']).count()
As shown above, expenses can be split by 'Program' or 'Category'. Both 'Program' and 'Category' have a 'Total Expenditures' attribute. This is the sum of all the other attributes in that group, and the 'Total Expenditures' for a 'Program' and a 'Category' in a given year will be equal to each other. They're just two different ways of partitioning the expenses and they both sum to the same total expenses. 'Total Expenditure Amount' is not the same as 'Total Expenditures'. 'Total Expenditure Amount' is simply the amount of money spent in a particular group. There are nine unique values for the year column as shown above since this dataset ranges from 2012 to 2020.
# The groups unique to Category:
expenses.loc[expenses['Group Type'] == 'Category']['Group'].unique()
# The groups unique to Program:
expenses.loc[expenses['Group Type'] == 'Program']['Group'].unique()
The below code was written to set up the following graphs:
group_list = expenses['Group'].unique()
features_array = []
for feature in group_list:
features_array.append(expenses.loc[expenses['Group'] == feature].sort_values(by='Year', ascending=True)['Total Expenditure Amount'].values)
orig_features = dict(zip(group_list, features_array))
features = copy.deepcopy(orig_features)
del features['Total Expenditures']
total_expenditures = expenses.loc[(expenses['Group Type'] == 'Category') & (expenses['Group'] == 'Total Expenditures')].sort_values(by='Year', ascending=True)['Total Expenditure Amount'].values
figsize = (20, 12)
title = "Total Expenses Over Time"
plot_line_graphs_expenses({'Total Expenditures' : total_expenditures}, title, figsize)
It's interesting to see how UMD's has more than $\$2.2$ billion just in expenses. It makes you wonder how much revenue they make to counter this expense. This is answered below!
categories = list(expenses.loc[expenses['Group Type'] == 'Category']['Group'].unique())
del categories[categories.index('Total Expenditures')]
programs = list(expenses.loc[expenses['Group Type'] == 'Program']['Group'].unique())
del programs[programs.index('Total Expenditures')]
categories_dict = {k:orig_features[k] for k in orig_features if k in categories}
figsize = (20, 12)
title = "Expenses by Category Over Time "
plot_line_graphs_expenses(categories_dict, title, figsize)
By category, it seems that paying off 'Salaries & Fringes' (benefits and other componesation) is the most expensive expense. After this, 'Fixed Charges, Debt Service, Rentals & Contracts' is the most costly expense for UMD. UMD is probably trying to reduce this expense out of all the expenses to maximize their net profit.
programs_dict = {k:orig_features[k] for k in orig_features if k in programs}
figsize = (20, 12)
title = "Expenses by Program Over Time "
plot_line_graphs_expenses(programs_dict, title, figsize)
When split by program, unsuprisingly, research and instruction are the most expensive expenses. This makes sense since UMD has a top-ranking computer science and physics department that both spend heavily on research. The 'Instruction' expense would probably be the sum of all the salaries of lecturers (not research professors).
The revenue dataframe contains data on how the university gets it's money either through 'Unrestricted' or 'Restricted' means. If you would like to learn more about how this dataframe was created, please click here: https://adharsh.github.io/Revenue.html
revenue
revenue.groupby(by=['Restricted Type', 'Revenue Type']).count()
Similar to expenses, revenue can be split by 'Unrestricted' or 'Restricted'. It can also be split by 'Total', and this occurs only once a year and contains the total revenue for that year and is the sum of both the 'Restricted' and 'Unrestricted' data for that year. This 'Total' only has one 'Revenue Type' called 'Total Operating Budget'. Each revenue type has an 'Amount' of money that is some percentage of the total money in that restricted type category. This percentage is described in '% of Total'. The total money in that restricted type category is a 'Revenue Type' called 'Total Restricted' or 'Total Unrestricted'. There are nine unique values for the year column as shown above since this dataset ranges from 2012 to 2020.
The code below was used to create the following revenue graphs:
total_revenue = revenue.loc[revenue['Revenue Type'] == 'Total Operating Budget'].sort_values(by='Year', ascending=True)['Amount'].values
figsize = (20, 12)
title = "Total Revenue Over Time"
plot_line_graphs_expenses({'Total Revenue' : total_revenue}, title, figsize)
It's interesting to note how high the revenue is increasing. It also looks very similar to the expenses... (keep reading below!)
The code below will create revenue plots:
group_list = revenue['Revenue Type'].unique()
features_array = []
for feature in group_list:
features_array.append(revenue.loc[revenue['Revenue Type'] == feature].sort_values(by='Year', ascending=True)['Amount'].values)
features = dict(zip(group_list, features_array))
unrestricted = list(revenue.loc[revenue['Restricted Type'] == 'Unrestricted']['Revenue Type'].unique())
restricted = list(revenue.loc[revenue['Restricted Type'] == 'Restricted']['Revenue Type'].unique())
unrestricted_dict = {k:features[k] for k in unrestricted if k in features}
figsize = (20, 12)
title = "Unrestricted Revenue Over Time "
plot_line_graphs_expenses(unrestricted_dict, title, figsize)
In unrestricted revenue, it seems like the university get's most of it's money from tuition and fees as well as state appropriations (probably through taxes). They're both really similar so that that means that after summing all the money from every student paying their tuition, there is a nearly equal amount of money given from the government to the university. This is quite a lot. As a result, you can probably see a lot of government influence at UMD. For example, ACES (a cybersecurity honors program at UMD) is tied very closely with Northrop Grumman, an American aerospace and defense company.
restricted_dict = {k:features[k] for k in restricted if k in features}
figsize = (20, 12)
title = "Restricted Revenue Over Time "
plot_line_graphs_expenses(restricted_dict, title, figsize)
It seems like federal grants and contracts make the most revenue in restricted revenue. For example, at the Joint Quantum Institute (a quantum computing research group at UMD), some labs are sponsored by the U.S. Army Research Lab.
The salary_increases dataframe contains data on how much salaries have been generally increasing from 1967 to 2019. This is percentage based and is based on the state cost of living adjusted percentage or 'State C.O.L.A. %' and the merit percentage 'Merit %' (which is like bonus compensation). These two percentages added up equal to the 'Total Increase %'. If you would like to learn more about how this dataframe was created, please click here: https://adharsh.github.io/Salary_Increases.html
salary_increases.loc[salary_increases["State C.O.L.A. %"] > 100, "State C.O.L.A. %"] = 0
salary_increases = salary_increases.drop(["Total Increase %"], 1)
salary_increases.tail()
# Compute the running sums of salary increase and cost of living over the years.
merit_pct = np.cumsum(np.flipud(np.array(salary_increases["Merit %"])))
state_cola = np.cumsum(np.flipud(np.array(salary_increases["State C.O.L.A. %"])))
# Plot salary increase vs. cost of living over the years.
figure(figsize=[17, 10])
plt.title("Comparing how Salaries have changed with Cost of Living (UMD Area, 1967-2019)", fontsize=20)
plt.xlabel("Year", fontsize=15)
plt.ylabel("Cumulative Increase by Percentage since 1967", fontsize=15)
years = [i for i in range(1967, 2020)]
plt.plot(years, merit_pct, color="green", linewidth=3, label="Cumulative Salary Increase");
plt.plot(years, state_cola, color="red", linewidth=3, label="Cumulative Cost of Living Increase");
plt.legend(loc="upper left", prop = {"size" : 12});
This double-line graph details how the salary and cost of living increases for Maryland employees. We initially chose the relationship between these two because we felt it would be interesting to see how well Maryland employers are able to provide benefits for their employees as the economy changes. We did not know what to expect before running this analysis, and after the visualization we can see that employees' salaries have generally increased over time as the state cost of living has also gone up. Specifically, the graph details the cumulative percentage increase of both of these measures over the years with 1967 as the zero point. We can see that both average salary and cost of living have a little more than doubled since 1967, with a cumulative percent increase of ~136% in average cost of living and ~124% in average salary. It is worth noting that there is about a 10% gap between cost of living and salary increase that started aroudn 1980 and continues through to today. However, if we look at similar trends in different areas across the nation, we can see that the trend shown by Maryland employees is one of the best for economic health. According to this Pew Research Study, on average across the country the purchasing power and benefits an average person receives have actually gone down as the cost of living goes up. This is because even though their salary may be increasing, it is not increasing as fast as the cost of living is increasing. However, from this plot we can see that even though there is a gap between cost of living and salary increase, this gap is not really widening over the years and has stayed relatively constant since 1980. This is an indication that employees in Maryland are matching their employees' salaries to the rate of increase in cost of living, ensuring that their jobs will still be enough to sustain them in an economy that is constantly shifting.
The salaries dataframe contains data on the salary for each faculty member from 2013 to 2019. Each 'Employee' has a job 'Title' and belongs to a 'Divison', which is part of a 'Department'. This 'Department' is part of a 'School'. The 'School' is just the first four or five characters of the 'Department' string. If you would like to learn more about how this dataframe was created, please click here: https://adharsh.github.io/Salaries.html
salaries = pd.read_pickle('salaries')
salaries.loc[salaries['Employee'] == 'Abed, Eyad H']
salaries.head(10)
For the following plots, no legend is provided (since it would be too long). Simply hover your mouse over the point (not the line but the point which intersects with the year) to get the hover text.
fig = px.line(salaries.groupby(['Year','Department']).sum().reset_index(),
x="Year", y="Salary", color="Department", hover_name="Department")
fig.update_layout(
showlegend=False,
title="Total Salaries for Each Department Over Time",
xaxis_title="Year",
yaxis_title="$ in Dollars",
font=dict(
family="Courier New, monospace",
size=20,
color="#000000"
)
)
fig.show()
If the data is grouped by department, it looks like the physics and athletics departments are the most heavily funded. However, a lot of these departments should be overlapping. For example, there are several unique CMNS departments because CMNS spans so many subjects. To combine these departments together, we can examine this graph by school, where CMNS would be a single line graph.
fig = px.line(salaries.groupby(['Year','School']).sum().reset_index(),
x="Year", y="Salary", color="School", hover_name="School")
fig.update_layout(
showlegend=False,
title="Total Salaries for Each School Over Time",
xaxis_title="Year",
yaxis_title="$ in Dollars",
font=dict(
family="Courier New, monospace",
size=20,
color="#000000"
)
)
fig.show()
As expected, CMNS is the most funded school since it contains so many departments.
fig = px.line(salaries.groupby(['Year','School']).mean().reset_index(),
x="Year", y="Salary", color="School", hover_name="School")
fig.update_layout(
showlegend=False,
title="Mean Salaries for Each School Over Time",
xaxis_title="Year",
yaxis_title="$ in Dollars",
font=dict(
family="Courier New, monospace",
size=20,
color="#000000"
)
)
fig.show()
Now at the individual level, it looks like the highest salaries for any professor of a department on average are from BMGT and PRES. So basically, administration and business professsors on average earn the most.
fig = px.line(salaries.groupby(['Year','Department']).max().reset_index(),
x="Year", y="Salary", color="Department", hover_name="Employee")
fig.update_layout(
showlegend=False,
title="Max Salary for Each Department Over Time",
xaxis_title="Year",
yaxis_title="$ in Dollars",
font=dict(
family="Courier New, monospace",
size=20,
color="#000000"
)
)
fig.show()
Now let's go even more specific and revert back to departments. Let's find out the max salary for each department over time. As you can see, each data point for a single line can represent a different employee. However, the line will still be in the same department. The top departments here are administration and athletics and it's also interesting to note how the athletics department skyrockets to nearly 900k a year.
figsize = (20, 12)
plt.style.use('seaborn-whitegrid')
_, ax = fig, ax = plt.subplots(1, 1, figsize=figsize, facecolor='w')
color=iter(cm.rainbow(np.linspace(0, 1, len(features))))
features = {'Total Revenue' : total_revenue, 'Total Expenditures' : total_expenditures}
for feature in features:
plt.plot(list(range(2012, 2021)), features[feature], marker='', linewidth=5, alpha=0.9, label=feature)
fmt = '${x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)
ax.yaxis.set_major_formatter(tick)
ax.tick_params(axis='both', which='major', pad=30)
plt.title('Revenue & Expenses Over Time', fontsize=70, pad=20)
plt.xlabel("Year", fontsize=50, labelpad=20)
plt.ylabel("$ in Dollars", fontsize=50, labelpad=20)
plt.xticks(fontsize=30)
plt.yticks(fontsize=30)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), prop={'size': 30})
plt.show()
The revenue and the expenses (from those respective dataframes) are practically the same, which implies their difference is zero. This means that all the gross revenue is distributed evenly back to the expenses. For example, the extra revenues would go into university projects like building more campus dorms or redistributed as scholarships for students or grants for professors which would all count as an expense.
figsize = (20, 12)
plt.style.use('seaborn-whitegrid')
_, ax = fig, ax = plt.subplots(1, 1, figsize=figsize, facecolor='w')
color=iter(cm.rainbow(np.linspace(0, 1, len(features))))
net_profit = [a_i - b_i for a_i, b_i in zip(total_revenue, total_expenditures)]
plt.plot(list(range(2012, 2021)), net_profit, marker='', linewidth=5, alpha=0.9, label=feature)
fmt = '${x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)
ax.yaxis.set_major_formatter(tick)
ax.tick_params(axis='both', which='major', pad=30)
plt.title('Net Profit Over Time', fontsize=70, pad=20)
plt.xlabel("Year", fontsize=50, labelpad=20)
plt.ylabel("$ in Dollars", fontsize=50, labelpad=20)
plt.xticks(fontsize=30)
plt.yticks(fontsize=30)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), prop={'size': 30})
plt.show()
Since we thought that the difference between revenues and expenses is zero, here is the plot of the difference. It seems like it's not zero in the year of 2019. Maybe this is a data recording error since at the time this was written, the year 2019 had not passed yet. Therefore, we can assume that the difference would actually be zero and our inferences in the above paragraph still hold true.
total_salaries = list(salaries.groupby(by='Year').sum()['Salary'].values)
total_expenses = list(expenses.loc[(expenses['Group Type'] == 'Program') & (expenses['Group'] == 'Total Expenditures')].sort_values(by='Year', ascending=True)[1:-1]['Total Expenditure Amount'].values)
categories = list(expenses.loc[expenses['Group Type'] == 'Category']['Group'].unique())
salaries_fringes = list(expenses.loc[expenses['Group'] == 'Salaries & Fringes'].sort_values(by='Year', ascending=True)[1:-1]['Total Expenditure Amount'].values)
fringes = [a_i - b_i for a_i, b_i in zip(salaries_fringes, total_salaries)]
# fringes = [a_i - b_i for a_i, b_i in zip(total_salaries, salaries_fringes)]
figsize = (20, 12)
plt.style.use('seaborn-whitegrid')
_, ax = fig, ax = plt.subplots(1, 1, figsize=figsize, facecolor='w')
color=iter(cm.rainbow(np.linspace(0, 1, len(list(range(2013, 2019))))))
plt.plot(list(range(2013, 2019 + 1)), total_salaries, marker='', linewidth=5, alpha=0.9, label='Total Salaries')
plt.plot(list(range(2013, 2019 + 1)), total_expenses, marker='', linewidth=5, alpha=0.9, label='Total Expenses')
plt.plot(list(range(2013, 2019 + 1)), salaries_fringes, marker='', linewidth=5, alpha=0.9, label='Salaries & Fringes')
plt.plot(list(range(2013, 2019 + 1)), fringes, marker='', linewidth=5, alpha=0.9, label='Fringes')
fmt = '${x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)
ax.yaxis.set_major_formatter(tick)
ax.tick_params(axis='both', which='major', pad=30)
ax.set_ylim(ymin=0)
plt.title('Comparing Total Salaries and Total Expenses Over Time', fontsize=70, pad=20)
plt.xlabel("Year", fontsize=50, labelpad=20)
plt.ylabel("$ in Dollars", fontsize=50, labelpad=20)
plt.xticks(fontsize=30)
plt.yticks(fontsize=30)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), prop={'size': 30})
plt.show()
We figured out how much money was spent for 'Salaries and Fringes' from the expenses dataframe and we can figure out the total salaries of all the faculty per year from the salaries dataframe. The result is shown above. 'Total Salaries' is the sum of all the salaries per year from the salaries dataframe. 'Salaries and Fringes' and 'Total Expenses' are from the expenses dataframe. 'Fringes' is the result of subtracting 'Total Salaries' from 'Salaries and Fringes'. As you can see from the above plot, 'Fringes' (or benefit compensation) account for nearly $\$500k$ a year.
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
num_future_years = 10
pred_tuition_fees = copy.deepcopy(tuition_fees)
start_year = int(tuition_fees['Year'].max()) + 1
tuition_slopes = {}
cols = list(tuition_fees.columns)
for yr in range(start_year, start_year + num_future_years + 1):
pred_tuition_fees.append({'Year' : str(yr)}, ignore_index=True)
for feature in cols[1:]:
X = tuition_fees[cols[0]].values.reshape(-1, 1)
Y = tuition_fees[feature].values.reshape(-1, 1)
reg = LinearRegression().fit(X, Y)
tuition_slopes[feature] = reg.coef_[0][0]
for yr in range(start_year, start_year + num_future_years + 1):
pred_tuition_fees.loc[pred_tuition_fees['Year'] == str(yr), feature] = reg.predict(X)[0][0]
features = {k:pred_tuition_fees[k] for k in cols[1:] if k in pred_tuition_fees}
figsize = (20, 12)
title = "Tuition Rate Prediction Over Time "
plot_line_graphs_tuition(pred_tuition_fees, features, title, figsize)
pickle.dump(tuition_slopes, open('finds/tuition_slopes.pkl', 'wb'), protocol=pickle.HIGHEST_PROTOCOL)
tuition_slopes
Linear regression has been performed on each feature shown above to see the average increase in dollars per year. As shown from the above plot and dictionary output, out-of-state students pay a lot! On average, an undergraduate out-of-state student's tuition increases by $\$1,241.53$ a year whereas an in-state undergraduate student's tuition increases by only $\$240.26$ a year! Similar conclusions can be made with graduate students as well. It's very striking how much UMD charges out-of-state students.
num_future_years = 10
pred_salaries = copy.deepcopy(salaries)
start_year = int(tuition_fees['Year'].max()) + 1
salary_slopes = {}
# for yr in range(start_year, start_year + num_future_years + 1):
# pred_salaries.append({'Year' : str(yr)}, ignore_index=True)
for feature in salaries['Employee'].unique():
X = salaries.loc[salaries['Employee'] == feature]['Year'].values.reshape(-1, 1)
Y = salaries.loc[salaries['Employee'] == feature]['Salary'].values.reshape(-1, 1)
reg = LinearRegression().fit(X, Y)
salary_slopes[feature] = reg.coef_[0][0]
# for yr in range(start_year, start_year + num_future_years + 1):
# pred_tuition_fees.loc[pred_tuition_fees['Year'] == str(yr), feature] = reg.predict(X)[0][0]
pickle.dump(salary_slopes, open('finds/salary_slopes.pkl', 'wb'), protocol=pickle.HIGHEST_PROTOCOL)
{k: salary_slopes[k] for k in list(salary_slopes)[:10]}
len(salaries['Employee'].unique())
The code above performs linear regression on every faculty member that has existed from 2013 to 2019 to find his/her average increase in salary in dollars per year. As shown by the number directly above, this is about 2,2076 linear regressions, and the code might take some time to run.